Introduction

Data

Simulating Data

The data used in this paper was generated specifically for the purpose and was not gathered at any specific community college. This was done for several reasons:

  1. Simulated data are generated with known parameters, allowing us to determine how accurately models recover true estimates
  2. Simulated data allow for greater generalizability. Data gathered at one institution may have been generated by a process that differs dramatically from other contexts
  3. Simulated data allow us to demonstrate theoretically important trends with because we know all relevant parameters

Data Generating Process

The data used in this study was generated through two processes. First, the number of new enrollees in a given semester was generated at the aggregate level (i.e. total number of new students). Second, the retention status of enrolled students in a given semester was generated at the individual level (i.e. did student n return at t + 1). Finally, the individual level predictions were aggregated and summed with the number of new enrollees to produce total headcount for a given semester. This process is visualized in chart 1.

Returning Enrollees

Existing students are known at the individual level and have three observed variables: Gender, current semester credit load, and cumulative credits taken. Each student’s likelihood of returning in the subsequent semester is a linear function of their gender and curvilinear function of cumulative credits taken such that students are more likely to be retained as they approach graduation and less likely thereafter. Constant and error terms are also added.

The formula used to generate each student’s probability of returning is given below as are the coefficients associated with each term. As noted above, I chose to generate data using a simple model for the purpose of demonstration; Gender and cumulative credit load are widely cited predictors of retention and the values chosen for them fall within reasonable bounds.

logit(y) = \(\beta_{Gender}\) + \({\beta_{Cumulative\ Credits}}^2\) + \(\beta_{Cumulative\ Credits}\) + C + \(\epsilon\)

Variable Level Min Max Mean Generation Parameters
Gender Binary 0 1 0.5 Sample (0:1) P(1) = 0.5
Credits Interval 1 21 9 Sample Truncated Normal \(\mu\) = 6, \(\sigma\) = 9
Cumulative Credits Interval 1 121 135 \(\sum_{(i,j) = 1}^n n_{i,j}\) -
Likelihood of Return Ratio 0.02 0.83 0.72 Linear Function See Equation 1
Return Binary 0 1 0.72 Sample (0:1) P(1) = Likelihood of Return

New Enrollees

The number of new enrollees in a given semester was generated at the aggregate level. No individual level characteristics are known about new enrollees until their first semester. The number of new enrollees in a given semester is a linear function of change in GDP and semester; values for the former were drawn randomly from a normal distribution with \(\mu\) = 0 and \(\sigma\) = 1. Constant and error terms are also added with final output of the model scaled up and rounded to an integer value.

The formula used to generate the number of new students each semester is given below and coefficients associated with each term can be found in the Appendix. As with the returning students model, this model was built with an eye towards parsimony and generalizability.

Term Model Value
\(\beta_{GDP}\) New 2
\(\beta_{Spring Semester}\) New 3
\(\beta_{Summer Semester}\) New 0
\(\beta_{Fall Semester}\) New 6
\(C_{New Students}\) New 0
\(\varepsilon_{New Students}\) New \(\ N(\mu = 0, \sigma = 1)\)
\(\beta_{Gender}\) Returning 0.1
\(\beta_{Cumulative Credits}\) Returning 0.02
\(C_{Returning Students}\) Returning 0.9
\(\varepsilon_{Returning Students}\) Returning \(\ N(\mu = 0, \sigma = 1)\)

Summary Statistics

Methods

Results

Discussion

In this paper, I have demonstrated the value in bringing theoretical knowledge to bear on predictive models of enrollment. Community colleges construct such models under unique constraints. Moreover, the financial consequences of inaccurate models is more acutely felt than at other types of institutions of higher education. While the extant literature has sought improvements to predictive accuracy by using ‘brute-force’ methods - e.g. increasing the number of features in models, implementing more model types, exhaustively searching across model hyper-parameters - I argue that these efforts essentially re-invent the wheel insofar as they ‘rediscover’ relationships that are already well established. Our theoretical knowledge of the predictors of enrollment and persistence is robust. That knowledge should be reflected in our empirical models.

Future Research